Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable code for dynamic parallelism #96

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

thedodd
Copy link
Contributor

@thedodd thedodd commented Nov 15, 2022

@thedodd
Copy link
Contributor Author

thedodd commented Nov 20, 2022

So, interestingly, I'm running into an issue where the generated code can not be loaded by Module::from_ptx. It will return error a PTX JIT compilation failed.

Some background on current testing:

  • I've put together a reference C++ program which uses dynamic parallelism (ultra simple).
  • I can execute the reference program and all is good, expected output/behavior.
  • I also have a reference Rust program which is attempting to use this update code for dynamic parallelism, same exact functionality, data types (fixed sized types in C++);
  • When I compare the PTX between the two programs, it is nearly identical;
  • C++ program runs, expected behavior and output.

Now, what is quite strange is that if I copy the PTX from the working C++ program over to the Rust program (disabling PTX gen in the Rust program to ensure the C++ PTX is not overwritten), the Rust program aborts with that same error a PTX JIT compilation failed.

  • According to ptxas, both PTX files are valid and compile to object code (ptxas -c ...).
  • This issue is triggered even from attempting to construct a stream device side.
    • Note that in my tests to narrow this down, I've removed stream construction and I am just passing in a null stream to the cuda launch call on the device.
    • It is just interesting that the module loader does not like the stream or the launch.

So, I am wondering:

  • Is there something intrinsically wrong with attempting to call cuda::cuModuleLoadDataEx when the PTX is using dynamic parallelism?
  • Is there a way we can bypass this?
  • This is where my experimentation is currently at.

@thedodd
Copy link
Contributor Author

thedodd commented Nov 20, 2022

Perhaps we need to be manually constructing a linker, linking the PTX and the cudadevrt.lib, then compiling to a cubin and such. Will try that.

@thedodd
Copy link
Contributor Author

thedodd commented Nov 20, 2022

Yea, that was it. Need to create a linker, add the PTX, add libcudadevrt (right now I have this hard-coded, but I need to create a dynamic search mechanism, as I don't think the cuda linker will do this on its own ... we'll see).

From there, I was able to successfully execute the PTX from the sample C++ app of mine. The generated Rust PTX has an invalid memory access taking place, and it looks like it is coming from how the buffer is being populated. This is still a step forward, as the code gen is much easier to fix. I at least know what I'm dealing with, instead of some opaque "JIT compilation failed" error.

@thedodd
Copy link
Contributor Author

thedodd commented Nov 20, 2022

Yea, that did it. Code gen is far from optimal for loading the param buffer. But it works, and I am able to successfully use dynamic parallelism from the Rust generated PTX end to end. Expected output and behavior.

Macro codegen for populating the buffer can be optimized further, as the generated PTX is not optimal. I'll focus on that later.

@@ -114,6 +114,28 @@ impl Linker {
}
}

/// Link device runtime lib.
pub fn add_libcudadevrt(&mut self) -> CudaResult<()> {
let mut bytes = std::fs::read("/usr/local/cuda-11/lib64/libcudadevrt.a")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this PR is finalized, this should maybe be replaced by searching CUDA_PATH? Not sure what is the proper way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamic Parallelism | implementation strategy
2 participants